We will be performing data cleaning, preparation and visualization on the World Happiness Report dataset.
The focus of this study will be to see:
# import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly_express as px
import plotly.io as pio
# to save plotly interactive plots as html files
# importing two datasets, one contains data for mutiple years while other contains data for the year 2021
# we will later merge the datasets
df_allyears = pd.read_csv('datasets/world-happiness-report.csv')
df_2021 = pd.read_csv('datasets/world-happiness-report-2021.csv')
print(f"Dataset contains : {len(df_allyears)} rows\n")
df_allyears.head(3)
Dataset contains : 1949 rows
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2008 | 3.724 | 7.370 | 0.451 | 50.8 | 0.718 | 0.168 | 0.882 | 0.518 | 0.258 |
| 1 | Afghanistan | 2009 | 4.402 | 7.540 | 0.552 | 51.2 | 0.679 | 0.190 | 0.850 | 0.584 | 0.237 |
| 2 | Afghanistan | 2010 | 4.758 | 7.647 | 0.539 | 51.6 | 0.600 | 0.121 | 0.707 | 0.618 | 0.275 |
# lets check what all year's data do we have in this dataset
df_allyears['year'].unique()
# df_allyears['Country name'].nunique()
array([2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,
2019, 2007, 2020, 2006, 2005], dtype=int64)
# checking if there are any null values
def check_null(df):
for col in df.columns:
values = np.mean(df[col].isnull())
print(f'{col} --- \t{values}% null values')
check_null(df_allyears)
Country name --- 0.0% null values year --- 0.0% null values Life Ladder --- 0.0% null values Log GDP per capita --- 0.018471010774756286% null values Social support --- 0.006670087224217547% null values Healthy life expectancy at birth --- 0.028219599794766546% null values Freedom to make life choices --- 0.016418676244227808% null values Generosity --- 0.045664443304258596% null values Perceptions of corruption --- 0.05643919958953309% null values Positive affect --- 0.011287839917906618% null values Negative affect --- 0.008209338122113904% null values
Looks like there are missing values in some of the columns but since the missing values are less than 1% of the total column values we can move forward.
# checking datatypes
df_allyears.dtypes
Country name object year int64 Life Ladder float64 Log GDP per capita float64 Social support float64 Healthy life expectancy at birth float64 Freedom to make life choices float64 Generosity float64 Perceptions of corruption float64 Positive affect float64 Negative affect float64 dtype: object
# lets bring in the second dataset so we can merge the two
print(f'Dataset contains : {len(df_2021)} rows\n')
df_2021.head(3)
Dataset contains : 149 rows
| Country name | Regional indicator | Ladder score | Standard error of ladder score | upperwhisker | lowerwhisker | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Ladder score in Dystopia | Explained by: Log GDP per capita | Explained by: Social support | Explained by: Healthy life expectancy | Explained by: Freedom to make life choices | Explained by: Generosity | Explained by: Perceptions of corruption | Dystopia + residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Finland | Western Europe | 7.842 | 0.032 | 7.904 | 7.780 | 10.775 | 0.954 | 72.0 | 0.949 | -0.098 | 0.186 | 2.43 | 1.446 | 1.106 | 0.741 | 0.691 | 0.124 | 0.481 | 3.253 |
| 1 | Denmark | Western Europe | 7.620 | 0.035 | 7.687 | 7.552 | 10.933 | 0.954 | 72.7 | 0.946 | 0.030 | 0.179 | 2.43 | 1.502 | 1.108 | 0.763 | 0.686 | 0.208 | 0.485 | 2.868 |
| 2 | Switzerland | Western Europe | 7.571 | 0.036 | 7.643 | 7.500 | 11.117 | 0.942 | 74.4 | 0.919 | 0.025 | 0.292 | 2.43 | 1.566 | 1.079 | 0.816 | 0.653 | 0.204 | 0.413 | 2.839 |
Looks like this dataset contains more columns than the one with mutiple years.
We will have to delete the extra columns as these are only present for 2021 and not all years.
Note : It can be seen that columns ( Positive affect & Negative affect) are not present in df_2021.
# before we specify the columns we want, we will add a new column 'year' so that we can add it during merge
df_2021['Year'] = 2021
# specifying the columns we want in the dataframe
df_allyears.columns
df_2021 = df_2021[['Country name', 'Regional indicator', 'Year', 'Ladder score', 'Logged GDP per capita',
'Social support', 'Healthy life expectancy',
'Freedom to make life choices', 'Generosity',
'Perceptions of corruption',]]
df_2021.head(3)
| Country name | Regional indicator | Year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Finland | Western Europe | 2021 | 7.842 | 10.775 | 0.954 | 72.0 | 0.949 | -0.098 | 0.186 |
| 1 | Denmark | Western Europe | 2021 | 7.620 | 10.933 | 0.954 | 72.7 | 0.946 | 0.030 | 0.179 |
| 2 | Switzerland | Western Europe | 2021 | 7.571 | 11.117 | 0.942 | 74.4 | 0.919 | 0.025 | 0.292 |
# lets check if df_2021 contains any null values
check_null(df_2021)
Country name --- 0.0% null values Regional indicator --- 0.0% null values Year --- 0.0% null values Ladder score --- 0.0% null values Logged GDP per capita --- 0.0% null values Social support --- 0.0% null values Healthy life expectancy --- 0.0% null values Freedom to make life choices --- 0.0% null values Generosity --- 0.0% null values Perceptions of corruption --- 0.0% null values
Great! All columns are filled to the trim!
# before we merge, let's rename columns on both df's for ease of merging
df_allyears.rename(columns={'Country name': 'Country', 'year': 'Year', 'Life Ladder': 'Score',
'Healthy life expectancy at birth': 'Healthy life expectancy',
'Log GDP per capita': 'GDP score', 'Freedom to make life choices': 'Freedom'},
inplace=True)
df_2021.rename(columns={'Country name': 'Country', 'Ladder score': 'Score', 'Regional indicator': 'Region',
'Logged GDP per capita': 'GDP score', 'Freedom to make life choices': 'Freedom'},
inplace=True)
# merging
# I could not find a better way to update the region values, hence I will be creating a merge
# just to get the region values for all the countries
temp_reg = pd.merge(df_allyears, df_2021, how='outer', on='Country')['Region']
df_allyears['Region'] = temp_reg
eda_happy = pd.merge(df_allyears, df_2021, how='outer',
on=['Country', 'Year', 'Score', 'GDP score','Social support',
'Healthy life expectancy', 'Freedom', 'Generosity',
'Perceptions of corruption', 'Region'])
# rearranging columns
r_list = ['Region', 'Country', 'Year', 'Score', 'GDP score', 'Social support', 'Healthy life expectancy', 'Freedom',
'Generosity', 'Perceptions of corruption', 'Positive affect', 'Negative affect']
eda_happy = eda_happy[r_list]
eda_happy.head(3)
| Region | Country | Year | Score | GDP score | Social support | Healthy life expectancy | Freedom | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | South Asia | Afghanistan | 2008 | 3.724 | 7.370 | 0.451 | 50.8 | 0.718 | 0.168 | 0.882 | 0.518 | 0.258 |
| 1 | South Asia | Afghanistan | 2009 | 4.402 | 7.540 | 0.552 | 51.2 | 0.679 | 0.190 | 0.850 | 0.584 | 0.237 |
| 2 | South Asia | Afghanistan | 2010 | 4.758 | 7.647 | 0.539 | 51.6 | 0.600 | 0.121 | 0.707 | 0.618 | 0.275 |
Great! now our dataset is ready!
# to create a choropleth graph we need country codes
# Luckily in our previously done project : Plastic Pollution we have country codes
# lets merge these codes to our eda_happy dataset
con_codes = pd.read_csv('datasets/per-capita-plastic-waste-vs-gdp-per-capita.csv')
con_codes.rename(columns={'Entity': 'Country'}, inplace=True)
con_codes = con_codes[['Country', 'Code']].drop_duplicates()
print(con_codes.head(3))
eda_happy = pd.merge(eda_happy, con_codes, how='left', on='Country')
eda_happy.sort_values(by='Year', inplace=True)
Country Code 0 Afghanistan AFG 220 Africa NaN 342 Akrotiri and Dhekelia OWID_AKD
# since we have missing score values for most of the countries for the years 2005 & 2006,
# we will be mapping from 2007 to see the change in levels for all the countries.
# Though, if you want to see the levels from 2005, you can change the dataframe parameter to : eda_happy
eda_mapping = eda_happy[(eda_happy['Year'] != 2005) & (eda_happy['Year'] != 2006)]
px.choropleth(eda_mapping, locations='Code', color='Score', hover_name='Country',
animation_frame='Year', color_continuous_scale=px.colors.sequential.Plasma,
projection='natural earth', title="Happiness Levels in Countries From 2007 - 2021",
template='seaborn', range_color=[2, 7])
# figure.update_layout(paper_bgcolor = '#2e3141', font_color='white')
We will see what were the factors leading to this!
We will see what were the factors leading to this!
np.mean(eda_happy['Score'])
# if we round this score we will get 5
5.471402287893232
%matplotlib inline
# To see the top 10 highest and lowest countries we need to create plot data differently
top_happy = eda_happy.groupby('Country', as_index=False)['Score'].mean().sort_values(
by='Score',ascending=False)[:10]
top_unhappy = eda_happy.groupby('Country', as_index=False)['Score'].mean().sort_values(
by='Score', ascending=True)[:10]
plt.style.use('seaborn')
plt.figure(2, figsize=(12,8))
sns.barplot(data=top_happy, x='Score', y='Country', palette='Blues_d')
plt.xlabel('Score',fontsize=14, fontweight='bold')
plt.ylabel('Country', fontsize=14, fontweight='bold')
plt.title('Top 10 highest happiness scored countries', fontsize=16, fontweight='bold')
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
# fig = plt.gcf()
plt.show()
# fig.savefig('top10happy.jpg')
We'll see what were the contributors that led to this!
%matplotlib inline
plt.style.use('seaborn')
plt.figure(3, figsize=(12,8))
sns.barplot(data=top_unhappy, x='Score', y='Country', palette='Reds_r_d')
plt.xlabel('Score',fontsize=14, fontweight='bold')
plt.ylabel('Country', fontsize=14, fontweight='bold')
plt.title('Top 10 lowest happiness scored countries', fontsize=16, fontweight='bold')
plt.xlim(0,8)
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
# plt.axes.grid(color='white')
# ax = plt.gca()
# ax.set_facecolor("#2e3141")
# fig = plt.gcf()
plt.show()
# fig.savefig('top10unhappyt.jpg', bbox_inches='tight')
We'll see what were the contributors that led to this!
# creating correlation matrix to see which factors contributes the most to happiness levels
happy_corr = eda_happy.corr()
plt.figure(figsize=(12,8))
plt.style.use('seaborn')
sns.heatmap(happy_corr, annot=True)
plt.xticks(fontsize=11, fontstyle='normal')
plt.yticks(fontsize=11, fontstyle='normal')
plt.title("Factor's Correlation with Happiness Score", fontsize=14, fontweight='bold')
# fig = plt.gcf()
plt.show()
# fig.savefig('happy_corr.jpg', bbox_inches="tight")
# for this we need to remove any null values
gdp_map = eda_happy.dropna()
figure = px.scatter(gdp_map, x='GDP score', y='Healthy life expectancy', color='Region',
template='plotly_white', hover_name='Country',
animation_group='Country')
figure.update_layout(plot_bgcolor='#2e3141', paper_bgcolor='#2e3141', legend_font_color='lightgray',
font_color='lightgray')
Thus most of the countries in the affected regions, such as Sub-Saharan Afria, South Asia & some countries in the Latin American Region must focus on the GDP score and how to improve health of its citizens.
This will in return increase the positive affectivity in its citizens, thus increasing social support and contributing to an overall higher happiness level.
While on the other hand, it is observed, countries having a progressive increase in its GDP score, such a Singapore in the Southeast Asian region have an increasing healthy life expectany throughout multiple years.
These also include most of the countries present in the Western European Region.